Author : Vu Tran. Other info is on github

Kaggle Competition: San Francisco Crime Classification

  • Info from Competition Site
    • Description
    • Evaluation
    • Data Set
  • First attempt:
    • Working with data
    • Feature 'review'
    • Training Naive Bayes
    • Predicting with Naive Bayes
    • Preparing for kaggle submission
    • Performance Evaluation
      • Splitting train data set
      • Evaluating performance using splitted data set
      • Plotting ROC curve
    • Hyperparameters
    • Other improvements
  • Second attempt (in progress)

Info from Competition Site

Description

Predict the category of crimes that occurred in the city by the bay

From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz.

Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.

From Sunset to SOMA, and Marina to Excelsior, this competition's dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given time and location, you must predict the category of crime that occurred.

We're also encouraging you to explore the dataset visually. What can we learn about the city through visualizations like this Top Crimes Map? The top most up-voted scripts from this competition will receive official Kaggle swag as prizes.

Evaluation

Submissions are evaluated using the multi-class logarithmic loss. Each incident has been labeled with one true class. For each incident, you must submit a set of predicted probabilities (one for every class). The formula is then,

logloss=−1N∑i=1N∑j=1Myijlog(pij), logloss=−1N∑i=1N∑j=1Myijlog⁡(pij),

where N is the number of cases in the test set, M is the number of class labels, loglog is the natural logarithm, yijyij is 1 if observation ii is in class jj and 0 otherwise, and pijpij is the predicted probability that observation ii belongs to class jj.

The submitted probabilities for a given incident are not required to sum to one because they are rescaled prior to being scored (each row is divided by the row sum). In order to avoid the extremes of the log function, predicted probabilities are replaced with max(min(p,1−10−15),10−15)max(min(p,1−10−15),10−15).

Submission Format

You must submit a csv file with the incident id, all candidate class names, and a probability for each class. The order of the rows does not matter. The file must have a header and should look like the following:

Id,ARSON,ASSAULT,BAD CHECKS,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,EMBEZZLEMENT,EXTORTION,FAMILY OFFENSES,FORGERY/COUNTERFEITING,FRAUD,GAMBLING,KIDNAPPING,LARCENY/THEFT,LIQUOR LAWS,LOITERING,MISSING PERSON,NON-CRIMINAL,OTHER OFFENSES,PORNOGRAPHY/OBSCENE MAT,PROSTITUTION,RECOVERED VEHICLE,ROBBERY,RUNAWAY,SECONDARY CODES,SEX OFFENSES FORCIBLE,SEX OFFENSES NON FORCIBLE,STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TREA,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS 0,0.9,0.1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1 ... etc.

Data Set

Data Files

File Name Available Formats
test.csv .zip (18.75 mb)
sampleSubmission.csv .zip (2.38 mb)
train.csv .zip (22.09 mb)

This dataset contains incidents derived from SFPD Crime Incident Reporting system. The data ranges from 1/1/2003 to 5/13/2015. The training set and test set rotate every week, meaning week 1,3,5,7... belong to test set, week 2,4,6,8 belong to training set.

Data fields

Dates - timestamp of the crime incident
Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict.
Descript - detailed description of the crime incident (only in train.csv)
DayOfWeek - the day of the week
PdDistrict - name of the Police Department District
Resolution - how the crime incident was resolved (only in train.csv)
Address - the approximate street address of the crime incident 
X - Longitude
Y - Latitude

First attempt

Working with data


In [1]:
import pandas as pd
import zipfile

#reading train dataset:
archive=zipfile.ZipFile("C:/Users/vutran/Desktop/github/kaggle/San Francisco Crime Classification/data/train.csv.zip",'r')
train_data=pd.read_csv(archive.open("train.csv"))
train_data.head()


Out[1]:
Dates Category Descript DayOfWeek PdDistrict Resolution Address X Y
0 2015-05-13 23:53:00 WARRANTS WARRANT ARREST Wednesday NORTHERN ARREST, BOOKED OAK ST / LAGUNA ST -122.425892 37.774599
1 2015-05-13 23:53:00 OTHER OFFENSES TRAFFIC VIOLATION ARREST Wednesday NORTHERN ARREST, BOOKED OAK ST / LAGUNA ST -122.425892 37.774599
2 2015-05-13 23:33:00 OTHER OFFENSES TRAFFIC VIOLATION ARREST Wednesday NORTHERN ARREST, BOOKED VANNESS AV / GREENWICH ST -122.424363 37.800414
3 2015-05-13 23:30:00 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO Wednesday NORTHERN NONE 1500 Block of LOMBARD ST -122.426995 37.800873
4 2015-05-13 23:30:00 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO Wednesday PARK NONE 100 Block of BRODERICK ST -122.438738 37.771541

In [2]:
train_data.tail()


Out[2]:
Dates Category Descript DayOfWeek PdDistrict Resolution Address X Y
878044 2003-01-06 00:15:00 ROBBERY ROBBERY ON THE STREET WITH A GUN Monday TARAVAL NONE FARALLONES ST / CAPITOL AV -122.459033 37.714056
878045 2003-01-06 00:01:00 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO Monday INGLESIDE NONE 600 Block of EDNA ST -122.447364 37.731948
878046 2003-01-06 00:01:00 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO Monday SOUTHERN NONE 5TH ST / FOLSOM ST -122.403390 37.780266
878047 2003-01-06 00:01:00 VANDALISM MALICIOUS MISCHIEF, VANDALISM OF VEHICLES Monday SOUTHERN NONE TOWNSEND ST / 2ND ST -122.390531 37.780607
878048 2003-01-06 00:01:00 FORGERY/COUNTERFEITING CHECKS, FORGERY (FELONY) Monday BAYVIEW NONE 1800 Block of NEWCOMB AV -122.394926 37.738212

In [3]:
train_data.dtypes


Out[3]:
Dates          object
Category       object
Descript       object
DayOfWeek      object
PdDistrict     object
Resolution     object
Address        object
X             float64
Y             float64
dtype: object

In [4]:
train_data.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 878049 entries, 0 to 878048
Data columns (total 9 columns):
Dates         878049 non-null object
Category      878049 non-null object
Descript      878049 non-null object
DayOfWeek     878049 non-null object
PdDistrict    878049 non-null object
Resolution    878049 non-null object
Address       878049 non-null object
X             878049 non-null float64
Y             878049 non-null float64
dtypes: float64(2), object(7)
memory usage: 67.0+ MB

In [11]:
pd.unique(train_data.Category)


Out[11]:
array(['WARRANTS', 'OTHER OFFENSES', 'LARCENY/THEFT', 'VEHICLE THEFT',
       'VANDALISM', 'NON-CRIMINAL', 'ROBBERY', 'ASSAULT', 'WEAPON LAWS',
       'BURGLARY', 'SUSPICIOUS OCC', 'DRUNKENNESS',
       'FORGERY/COUNTERFEITING', 'DRUG/NARCOTIC', 'STOLEN PROPERTY',
       'SECONDARY CODES', 'TRESPASS', 'MISSING PERSON', 'FRAUD',
       'KIDNAPPING', 'RUNAWAY', 'DRIVING UNDER THE INFLUENCE',
       'SEX OFFENSES FORCIBLE', 'PROSTITUTION', 'DISORDERLY CONDUCT',
       'ARSON', 'FAMILY OFFENSES', 'LIQUOR LAWS', 'BRIBERY',
       'EMBEZZLEMENT', 'SUICIDE', 'LOITERING', 'SEX OFFENSES NON FORCIBLE',
       'EXTORTION', 'GAMBLING', 'BAD CHECKS', 'TREA', 'RECOVERED VEHICLE',
       'PORNOGRAPHY/OBSCENE MAT'], dtype=object)

In [12]:
pd.unique(train_data.Category).shape


Out[12]:
(39L,)

Now that we already have general idea of Data Set. We next clean, transform data to create useful features for machine learning

Feature 'Dates' and 'DayOfWeek'

Deature Dates include both date and time, I'll shall only use time


In [51]:
feature_hour=pd.to_datetime(train_data.Dates).dt.hour
pd.unique(feature_hour)


Out[51]:
array([23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10,  9,  8,  7,
        6,  5,  4,  3,  2,  1,  0], dtype=int64)

In [ ]:
dow = {
    'Monday':0,
    'Tuesday':1,
    'Wednesday':2,
    'Thursday':3,
    'Friday':4,
    'Saturday':5,
    'Sunday':6
}
df['dayofweek'] = df.DayOfWeek.map(dow)